Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model
نویسندگان
چکیده
We present an inference algorithm that organizes observed words (tokens) into structured inflectional paradigms (types). It also naturally predicts the spelling of unobserved forms that are missing from these paradigms, and discovers inflectional principles (grammar) that generalize to wholly unobserved words. Our Bayesian generative model of the data explicitly represents tokens, types, inflections, paradigms, and locally conditioned string edits. It assumes that inflected word tokens are generated from an infinite mixture of inflectional paradigms (string tuples). Each paradigm is sampled all at once from a graphical model, whose potential functions are weighted finitestate transducers with language-specific parameters to be learned. These assumptions naturally lead to an elegant empirical Bayes inference procedure that exploits Monte Carlo EM, belief propagation, and dynamic programming. Given 50–100 seed paradigms, adding a 10million-word corpus reduces prediction error for morphological inflections by up to 10%.
منابع مشابه
Tree Structured Dirichlet Processes for Hierarchical Morphological Segmentation
This article presents a probabilistic hierarchical clustering model for morphological segmentation. In contrast to existing approaches to morphology learning, our method allows learning hierarchical organization of word morphology as a collection of tree structured paradigms. The model is fully unsupervised and based on the hierarchical Dirichlet process (HDP). Tree hierarchies are learned alon...
متن کاملContent-Based Musical Similarity Computation using the Hierarchical Dirichlet Process
We develop a method for discovering the latent structure in MFCC feature data using the Hierarchical Dirichlet Process (HDP). Based on this structure, we compute timbral similarity between recorded songs. The HDP is a nonparametric Bayesian model. Like the Gaussian Mixture Model (GMM), it represents each song as a mixture of some number of multivariate Gaussian distributions However, the number...
متن کاملHuman-Computer Interactive Chinese Word Segmentation: An Adaptive Dirichlet Process Mixture Model Approach
Previous research shows that Kalman filter based human-computer interactive Chinese word segmentation achieves an encouraging effect in reducing user interventions, but suffers from the drawback of incompetence in distinguishing segmentation ambiguities. This paper proposes a novel approach to handle this problem by using an adaptive Dirichlet process mixture model. By adjusting the hyperparame...
متن کاملHierarchical Dirichlet Processes
We consider problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well...
متن کاملNonparametric Bayesian Clustering via Infinite Warped Mixture Models
We introduce a flexible class of mixture models for clustering and density estimation. Our model allows clustering of non-linearly-separable data, produces a potentially low-dimensional latent representation, automatically infers the number of clusters, and produces a density estimate. Our approach makes use of two tools from Bayesian nonparametrics: a Dirichlet process mixture model to allow a...
متن کامل